1 Problem at hand

This report attempts to predict whether someone has an income of over 50k USD or below 50k USD a year based on several parameters:

  • age: the age of an individual

  • workclass: a general term to represent the employment status of an individual

  • fnlwgt: final weight. In other words, this is the number of people the census believes the entry represents

  • education: the highest level of education achieved by an individual.

  • education-num: the highest level of education achieved in numerical form.

  • marital-status: marital status of an individual.

  • occupation: the general type of occupation of an individual

  • relationship: represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute and is somewhat redundant with marital status. We might not make use of this attribute at all

  • race: Descriptions of an individual’s race

  • sex: the biological sex of the individual

  • capital-gain: capital gains for an individual

  • capital-loss: capital loss for an individual

  • hours-per-week: the hours an individual has reported to work per week

  • native-country: country of origin for an individual

Data set is taken from kaggle.com.

2 Structure of report

  • Read data and cleansing
  • Cross validation
  • Check whether data is balanced or not
  • Model #1: Naive Bayes
  • Predict future data with Naive Bayes
  • Model evaluation for Naive Bayes:
  • Disadvantages of Naive Bayes
  • Confusion Matrix
  • ROC/AUC
  • Model #2: Decision Tree
  • Model evaluation for Decision Tree:
  • Confusion Matrix
  • Disadvantages of Decision Tree
  • Pruning
  • (possible) Model #3: Random forest
  • Disadvantages
  • Comparison and final conclusion

3 Read data and cleansing

library(tidyverse)
## ── Attaching packages ────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
income <- read_csv("income_evaluation.csv")
## Parsed with column specification:
## cols(
##   age = col_double(),
##   workclass = col_character(),
##   fnlwgt = col_double(),
##   education = col_character(),
##   `education-num` = col_double(),
##   `marital-status` = col_character(),
##   occupation = col_character(),
##   relationship = col_character(),
##   race = col_character(),
##   sex = col_character(),
##   `capital-gain` = col_double(),
##   `capital-loss` = col_double(),
##   `hours-per-week` = col_double(),
##   `native-country` = col_character(),
##   income = col_character()
## )
head(income)
anyNA(income)
## [1] FALSE
income <- income %>% 
  mutate_if(is.character, as.factor) %>% 
  select(-c(relationship, fnlwgt))
income

4 Cross Validation

library(rsample)
set.seed(100)
idx <- initial_split(data = income, prop = 0.8, strata = income )
test <- testing(idx)
train <- training(idx)

5 Check whether data is balanced or not

prop.table(table(income$income))
## 
##     <=50K      >50K 
## 0.7591904 0.2408096
prop.table(table(test$income))
## 
##     <=50K      >50K 
## 0.7591768 0.2408232
prop.table(table(train$income))
## 
##     <=50K      >50K 
## 0.7591939 0.2408061

Since it is relatively balanced (not like 90:10 or 95:5), we can move on.

6 Model 1: Naive Bayes

library(e1071)
model_naive <- naiveBayes(income ~., train)
#no laplace is needed
pred <- predict(model_naive, test, type = "class")
prob <- predict(model_naive, test, type = "raw")

7 Evaluation of model

7.1 Disadvantages of Naive Bayes model

Naive Bayes are like what its name suggests: naive. Its algorithm thinks that all predictors are independent of each other, which may not always be true. For example, in this data set, there is a column which indicates a person's number of educational years attended as well as their last graduated education degree. Logically, of course we know that someone who graduated with a Master's degree will inevitably have larger number of educational years than someone who graduated with a Bachelor's degree, because in order to graduate with a Master's degree, you need to have a Bachelor's degree first. Hence these predictors are not likely to be completely independent, unlike what this Naive Bayes model had suggested, which may then interfere with our model's reliability in predicting future model.

7.2 Confusion Matrix

library(caret)
confusionMatrix(data = pred, reference = test$income, positive = ">50K")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  4654  918
##      >50K    289  650
##                                         
##                Accuracy : 0.8146        
##                  95% CI : (0.805, 0.824)
##     No Information Rate : 0.7592        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.4126        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.41454       
##             Specificity : 0.94153       
##          Pos Pred Value : 0.69223       
##          Neg Pred Value : 0.83525       
##              Prevalence : 0.24082       
##          Detection Rate : 0.09983       
##    Detection Prevalence : 0.14422       
##       Balanced Accuracy : 0.67804       
##                                         
##        'Positive' Class : >50K          
## 

From this confusion matrix, we can see that the accuracy of the model in predicting future data is 0.8171, whereas the recall/sensitivity metric is 0.4171 and the precision is 0.7025.

In this particular problem, it is hard to see which metric is more useful (eg recall or precision) and hence I will merely look into accuracy. This model has achieved a great accuracy (more than 80%), though it has lacked terribly in terms of recall metric and fared okay in terms of precision metric.

7.3 ROC/AUC

library(ROCR)
roc <- prediction(predictions = prob[,2], labels = as.numeric(ifelse(test$income == ">50K", 1, 0)))

perf <- performance(prediction.obj = roc, measure = "tpr", x.measure = "fpr")
plot(perf)

auc <- performance(roc, "auc")
auc@y.values
## [[1]]
## [1] 0.8563925

One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. Hence since the probability is 0.8563925 (which is close to 1), we can tell that this model has succeeded relatively well in predicting future data.

8 Model #2: Decision Tree

library(partykit)
model_dt <- ctree(formula = income ~., data = train)
plot(model_dt, type = "simple")

pred1 <- predict(model_dt, test, type = "response")
pred2 <- predict(model_dt, train, type = "response")

9 Model evaluation for Decision Tree:

9.1 Confusion Matrix

confusionMatrix(pred1, reference = test$income, positive = ">50K")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  4512  821
##      >50K    431  747
##                                           
##                Accuracy : 0.8077          
##                  95% CI : (0.7979, 0.8172)
##     No Information Rate : 0.7592          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4253          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4764          
##             Specificity : 0.9128          
##          Pos Pred Value : 0.6341          
##          Neg Pred Value : 0.8461          
##              Prevalence : 0.2408          
##          Detection Rate : 0.1147          
##    Detection Prevalence : 0.1809          
##       Balanced Accuracy : 0.6946          
##                                           
##        'Positive' Class : >50K            
## 

Just like what I had said above, in this case, it is not clear which metric between recall and precision to prioritise over as we are merely predicting income ranges, hence we will be looking mainly at accuracy.

From the confusion matrix above alone, we can tell that the accuracy of this model in predicting future data is 0.8076, which is generally quite good. However, it fared very bad in terms of its recall metric (merely scoring 0.4758) and average in terms of its precision metric (0.6338).

9.2 Disadvantages of Decision Tree

One possible disadvantage of decision tree is that its models tend to be over-fitting. This means that it is only able to predict trained data instead of unseen tested data. This is because decision tree established a certain set of rules based on the trained data and apply it to the tested data, which may not follow the same set of rules like what is learnt by the algorithm.

confusionMatrix(data = pred2, reference = train$income, positive = ">50K")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K  >50K
##      <=50K 18156  3148
##      >50K   1621  3125
##                                           
##                Accuracy : 0.8169          
##                  95% CI : (0.8122, 0.8216)
##     No Information Rate : 0.7592          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4539          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4982          
##             Specificity : 0.9180          
##          Pos Pred Value : 0.6584          
##          Neg Pred Value : 0.8522          
##              Prevalence : 0.2408          
##          Detection Rate : 0.1200          
##    Detection Prevalence : 0.1822          
##       Balanced Accuracy : 0.7081          
##                                           
##        'Positive' Class : >50K            
## 

However, looking at the confusion matrix for the prediction of trained data, the values of the three metrics do not differ much from the prediction of tested data. It has slightly better accuracy (0.8131 vs 0.8091), better recall (0.41240 vs 0.39413) and better precision (0.68621 vs 0.67838).

In order to compare even better, we'll take a look at their respective ROC/AUCs.

prob1 <- predict(model_dt, test, type = "prob")
prob2 <- predict(model_dt, train, type = "prob")
roc1 <- prediction(predictions = prob1[,2],
           labels = as.numeric(ifelse(test$income == ">50K", 1, 0)))
roc2 <- prediction(predictions = prob2[,2],
           labels = as.numeric(ifelse(train$income == ">50K", 1, 0)))
roc_perf <- performance(prediction.obj = roc1, measure = "tpr", x.measure = "fpr")
roc_perf2 <- performance(prediction.obj = roc2, measure = "tpr", x.measure = "fpr")
plot(roc_perf)

plot(roc_perf2)

Based on the ROC plot, both looks almost the same.

auc1 <- performance(roc1, "auc")
auc2 <- performance(roc2, "auc")
auc1@y.values
## [[1]]
## [1] 0.8315409
auc2@y.values
## [[1]]
## [1] 0.8469928

From this, we can tell that the AUC values are almost similar too. Hence, it can be said that there is no problem of over-fitting in this decision tree model.

9.3 Pruning

Despite not having an overfitting problem, it is still best that we tune our decision tree model by pruning it. Pruning allows us to determine the p-value before each node branches/splits into another node (mincriterion), the minimum number of observations on each leaf node (minbucket) as well as the minimum number of observations before branching out/splitting (minsplit).

model_dt1 <- ctree(formula = income ~., data = train, control = ctree_control(mincriterion = 0.95,
                                                                   minsplit = 1000,
                                                                   minbucket = 200))
plot(model_dt1, type = "simple")

predfinal <- predict(model_dt1, test, type = "response")
confusionMatrix(predfinal, test$income, positive = ">50K")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction <=50K >50K
##      <=50K  4589  885
##      >50K    354  683
##                                        
##                Accuracy : 0.8097       
##                  95% CI : (0.8, 0.8192)
##     No Information Rate : 0.7592       
##     P-Value [Acc > NIR] : < 2.2e-16    
##                                        
##                   Kappa : 0.4116       
##                                        
##  Mcnemar's Test P-Value : < 2.2e-16    
##                                        
##             Sensitivity : 0.4356       
##             Specificity : 0.9284       
##          Pos Pred Value : 0.6586       
##          Neg Pred Value : 0.8383       
##              Prevalence : 0.2408       
##          Detection Rate : 0.1049       
##    Detection Prevalence : 0.1593       
##       Balanced Accuracy : 0.6820       
##                                        
##        'Positive' Class : >50K         
## 

From the above confusion matrix, we can tell that pruning brings little change to the metrics because from the decision tree model plot, we can tell that most of the initial model's nodes have p values of < 0.001. Hence it is difficult to change the mincriterion parameter and expect a change in the metrics values on the confusion matrix. The huge number of observations on the initial model's nodes also made it hard for us to tweak the minsplit and minbucket to bring about a significant change in the metrics.

10 (possible) Model #3: Random forest

In order to have a better model, it is usually better to use random forest than decision tree. This is because random forest consists of a lot of decision trees. We are also able to prevent the possible problem of overfitting by utilising k-fold cross validation in order to ensure that there are multiple different train datasets and test datasets.

#k-fold cross validation
set.seed(417)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
#5 folds and repeated 3 times
library(randomForest)
#model_rf <- train(income ~., data = train, method = "rf", trControl= ctrl)

10.1 Disadvantages of Random Forest

However, despite all the benefits of random forest, it requires a huge amount of computation and hence because of that, I did not run the above code as my laptop is too old to run it fast enough.

11 Comparison and final conclusion

Both models have their own disadvantages and advantages:

Naive Bayes model: + Quick computation - Treat all predictors to be independent of each other

Decision Tree: + Establish rule clearly and allow predictors to be dependent of each other - Tends to be overfitting

Random Forest: + Accurate, reliable - Heavy computation

I would say for this problem case, the best model would probably be Random Forest. This is because, as I said in my report above, it is likely that some of the predictors in the dataset are dependent of each other, making the Naive Bayes model to be lacking. In addition, I don't think the Decision Tree model is the best in predicting future data as well because Random Forest allows us to overcome the limitations of Decision Tree and predict even more accurate future data through its k-fold cross-validation computation and use of multiple Decision Trees.

With that being said, I still think Naive Bayes and Decision Tree are two very powerful algorithms; though they are better used to predict different things. Naive Bayes can be used for text classification, because it assumes all predictors to be independent (which in this case is good because all words are treated to be independent and hence has equal value of each other), whereas Decision Tree can be used for predicting numerical values because it is able to overcome the need for no multi-collinearity (because it allows its predictors to be dependent of each other) in Linear Regression models.

I will discuss both cases in separate reports. Do check them out.